-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add file metadata columns support for spark parquet #7880
Add file metadata columns support for spark parquet #7880
Conversation
✅ Deploy Preview for meta-velox canceled.
|
@majetideepak also help for review and I saw looks @aditi-pandit also submit a similar PR #8800 for addressing presdo engine |
@gaoyangxiaozhu : Have couple of high-level comments. There are several missing pieces : i) A TableScan output operator test like https://github.com/facebookincubator/velox/pull/8800/files#r1498654870. This would introduce the wiring for metadata columns in HiveConnector TestBase classes. It might be simpler to change #8800 to use metadata_columns parameters for HiveSplit instead. wdyt ? |
thanks @aditi-pandit for i) Yes, a test is needed and I saw your PR already have it. for iii) I actually before don't know what's the synthesized means and also i don't found anywhere code path use it, so i just added a KMetadata to easily mark it is metadata column, but we can still use synthesized it should be ok, I'd like metadata naming due to I don't understand what synthesized means. Sure @aditi-pandit , it's OK if you can update your PR to move using metadata_columns parameters and remove any hard code checking for know it is metadata or not. For filter part just reference change
|
…me' to be queried in SQL (#8800) Summary: $file_size and $file_modified_time are queryable synthesized columns for Hive tables in Presto. Spark also has bunch of such queryable synthesized columns (#7880). The columns are passed by the co-ordinator to the worker in the HiveSplit. i) Velox HiveSplit needed to be enhanced to get filesize and file_modified_time metadata in a generic map data-structure of (column name, value) from Prestissimo. ii) These values should be populated by SplitReader into TableScanOperator output buffers. This also needs a Prestissimo change to populate the HiveSplit with this info sent in the fragment prestodb/presto#21965 Fixes prestodb/presto#21867 gaoyangxiaozhu will have a follow up PR on the Spark integration. Pull Request resolved: #8800 Reviewed By: mbasmanova Differential Revision: D54512245 Pulled By: Yuhta fbshipit-source-id: 190a97f9fcb1e869fff82e0a2264d57f9915376e
closed with @aditi-pandit 's PR is merged do the same thing. |
…me' to be queried in SQL (facebookincubator#8800) Summary: $file_size and $file_modified_time are queryable synthesized columns for Hive tables in Presto. Spark also has bunch of such queryable synthesized columns (facebookincubator#7880). The columns are passed by the co-ordinator to the worker in the HiveSplit. i) Velox HiveSplit needed to be enhanced to get filesize and file_modified_time metadata in a generic map data-structure of (column name, value) from Prestissimo. ii) These values should be populated by SplitReader into TableScanOperator output buffers. This also needs a Prestissimo change to populate the HiveSplit with this info sent in the fragment prestodb/presto#21965 Fixes prestodb/presto#21867 gaoyangxiaozhu will have a follow up PR on the Spark integration. Pull Request resolved: facebookincubator#8800 Reviewed By: mbasmanova Differential Revision: D54512245 Pulled By: Yuhta fbshipit-source-id: 190a97f9fcb1e869fff82e0a2264d57f9915376e
Spark support people query file metdata as
file_size
,file_name
,file_path
,file_modified_time
,file_block_start
etc for Hive tables as seperated file metadata column. Checking this for all file metadata supported query by spark https://github.com/apache/spark/blob/081c7a7947a47bf0b2bfd478abdd4b78a1db3ddb/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala#L183C2-L193C56.The PR extends the
HiveSplit
with a new parametermetadaColumns
to let upsteram computing engine as spark to pass the initialized const file metadata columns (if have) to velox connector split when constructed, it use to fix file metadata columns null issue when intergrating with Velox using Spark.Checking this issue #8173 for details context.
It is also a dependency of Gluten repository PR apache/incubator-gluten#3870